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1 Introduction 

Although nearly forty years old, Ziv and Lempel’s LZ77 [5] is still one of the 
best algorithms for compressing repetitive datasets such as genomic databases, 
software repositories and versioned text. The LZ77 parse of a string can also be 
used to speed up online pattern-matching and to reduce the size of indexes; see, 
e g-, [I]- When the dataset is too large to fit in internal memory, however, com¬ 
puting the parse is a challenge. The best known external-memory algorithm [5] 
for computing the parse exactly takes 0(n^/{MB)) I/Os, where M is the size 
of the internal memory and B is the size of a disk block. There are more efhcient 
external-memory algorithms that can be used to approximate the parse — see, 
e.g., g] — but the known approximation ratios are at least logarithmic. 

In this paper we describe a randomized algorithm with which, given a positive 
e < 1 and read-only access to a string S'[I..n] whose LZ77 parse consists of z 
phrases, with high probability we can build an LZ77-like parse of S that consists 
of 0{zle) phrases using time, I/Os and 0{zle) space. By 

“LZ77-like” we mean that, first, the concatenation of the phrases is S\ second, 
for each phase 5'[f../], either i = j, in which case we encode S'[i] as (0, or 
5'[i../] first occurs at position i' in — I], in which case we encode S[i..j] as 
{i',j - i -k I). 

2 Algorithm 

Assume we have an easy way to determine where substrings of lengths n, /(n), 
/(/(n)),..., 2 first occur in S, where f(£) = min(|'£(l — l/n^)'\,l — 1). Then we 
can approximate the LZ77 parse by calling parse(l, n, n), where parse(i,j, £) is 
the following recursive procedure: 

1. if .^ = 1 then we encode ..., 5'[j] as (0, 5'[i]),..., (0, S[j]) and return; 

2. if j — i + l<i then we call parse(i, j, /(£)) and return; 

3. if S\j — £-\-l..j\ first occurs at position i' in 5'[l..j —1], then we call parse(i,/ — 

f,£), encode S'[j — £ + l..j\ as and return; 

4. for k from 0 to [(j — i + l)/(.\ — 1, if 5'[z -k k(...i -k (fc -k \)£ — 1] first occurs 
at position i' in S'[I..* -k (fc -k !)£ — 2], then we call parse(i,i -k fc^ — l,f{i)), 
encode S[i + ki.A -k (fc -k 1)1' — 1] as {%',£), call parse(i -k (fc -k l)£,j,£), and 
return; 

5. if none of the conditions above are met, then we call parse(i, j,/(^)) and 
return. 


For example, suppose S = ababbabbaabbabbaababa and 2V = 4, so f(£) = 
min([3£/4l,£-l): 

1. parse(l, 21, 21) calls parse(l, 21,16); 

2. parse(l, 21,17) calls parse(l, 21,12); 

3. parse(l, 21,14) calls parse(l, 21,9); 

4. parse(l, 21, 9) calls parse(l, 9,9), 
encodes S'[10..18] as (3,9), 

and calls parse(19, 21, 9); 

5. parse(l,9,9) calls parse(l, 9, 7); 

6. parse(l,9,7) calls parse(l, 9, 6); 

7. parse(l,9,6) calls parse(l, 9, 5); 

8. parse(l,9,5) calls parse(l, 4, 5), 
and encodes 5'[5..9] as (2,5); 

9. parse(l,4, 5) calls parse(l, 4,4); 

10. parse(l,4,4) calls parse(l, 4, 3); 

11. parse(l,4,3) calls parse(l, 4, 2); 

12. parse(l,4, 2) calls parse(l,2,2) 
and encodes 5'[3..4] as (1,2); 

13. parse(l,2,2) calls parse(l, 2, 1); 

14. parse(l,2,l) encodes S'!!] as (0, a), 
and encodes S'[2] as (0,b); 

15. parse(19, 21,9) calls parse(19, 21,7); 

16. parse(19, 21, 7) calls parse(19, 21,6); 

17. parse(19, 21,6) calls parse(19, 21, 5); 

18. parse(19, 21, 5) calls parse(19, 21,4); 

19. parse(19, 21, 4) calls parse(19, 21, 3); 

20. parse(19,21,3) encodes 5'[19..21] as (1,3). 

Thus, our parse is 

a, b, ab, babba, abbabbaab, aba, 

encoded as 

(0,a),(0,b),(l,2),(2,5),(3,9),(l,3), 
while the true LZ77 parse is 

a, b, abb, abbaa, bbabbaaba, ba . 

It is not difficult to verify that our algorithm terminates and, assuming we 
find the first occurrences of substrings correctly, produces a valid encoding. The 
interesting problems are, first, to show how we can easily determine where sub¬ 
strings first occur in S and, second, to analyse its performance. Notice that if 
we make a recursive call on an interval the only substrings of length £ we 

ever consider in that interval are in 


{5'[i -I- k£..i -|- (fc -I- 1)£ — 1] : 0 < fc < [(j — i -I- l)/£\ — 1} U 
{S[j-{k + l)£+l..j-k£] : Q <k <[{j - i + l)/£\-1] . 



If we compute the Karp-Rabin fingerprints [3] of those substrings and store 
them in a membership data structure, such as a hash table, then we can search for 
their first occurrences by passing a sliding window of length t over S', maintaining 
the fingerprint of the window and performing a membership query for it at each 
step. In fact, if we keep the branches of the recursion synchronized properly, 
we can use a single membership data structure and a single pass of a sliding 
window to find the first occurrences of all the substrings of length I we ever 
consider while parsing S. Of course, using Karp-Rabin fingerprints means our 
algorithm is randomized and could produce an invalid encoding, albeit with a 
very low probability. We can check an encoding by comparing each phrase to 
the earlier substring it should match. 

In our example, step I requires passes with sliding windows of length 21 and 
16 before we can execute parse(l, 21, 21) and parse(l, 21,16); step 2, with one of 
length 12; step 3, with one of length 9; step 4 does not require a pass; steps 5 
and 15, with one of length 7; steps 6 and 16, with one of length 6; steps 7 and 
17, with one of length 5; step 8 does not require a pass; steps 9 and 18, with one 
of length 4; steps 10 and 19, with one of length 3; step 11, with one of length 2; 
steps 12, 13, 14 and 20 do not require a pass. 

A practical optimization would be to build a table that stores the position 
of the first occurrence of each substring of S of length /^“^(n) < log^ M — c, 
where M is the size of the internal memory and c is a constant. Building this 
table takes 0{n log^ M) time, one pass over S and a 1 /(t° fraction of the internal 
memory. It allows us to avoid making passes with sliding windows shorter than 
log^ M -c. 


3 Time, I/O and Space Bounds 


Our algorithm’s time and I/O complexities are dominated by the time and I/Os 
needed to make a pass over S for each length n, /(n), f{f{n )),..., 2. If we use 
a perfect hash table for the fingerprints, the time for each pass is 0{n), and 
calculation shows that the number of passes is O^log ^ nj = 0{n'^ \ogn). 
Therefore, we use log n) total time and O log{n)/B) total I/Os. 

The space complexity is dominated by the data structures for handling the 
recursion itself, and by the hash tables. The total size of the data structures for 
the recursion can be bounded in terms of the number of phrases in our parse, 
which we show in the next section to be 0{z/e). We need only store one hash 
table at a time, and its size is proportional to the number of substrings we are 
currently considering. We consider a substring only if the all larger substrings 
overlapping it do not occur earlier in S, meaning they cross or end at a bounary 
between phrases in the LZ77 parse. It follows that each substring of length £ we 
consider must be within distance of a phrase boundary. Therefore, we 

consider 0{z) substrings of any given length. 




4 Approximation Ratio 


Consider how many phrases in our parse can overlap a phrase in the LZ77 

parse. The longest phrase we use that overlaps S[i..j] either completely covers 
S[i..j], covers a proper prefix of it, covers a proper suffix of it, or is properly 
contained in it. The first case is trivial; the second and third cases are essentially 
symmetric, as we are left trying to bound the number of our phrases that can 
overlap either a proper suffix or a proper prefix of 5'[i..j] that is immediately 
preceded or followed by one of our phrases; and the fourth case is a combination 
of the second and third, as we are left trying to bound the number of our phrases 
that can overlap a proper prefix and a proper suffix of that are separate 

by one of our phrases. 

Without loss of generality, we consider only the second case. That is, assume 
1] is covered by one of our phrases, for some i' > i, and consider how many 
of our phrases can overlap S[i'..j]. By inspection of our algorithm, the longest 
phrase we use that overlaps S[i'..j] either covers a suffix of it, say S[j' + l-.j], or 
starts at i' and covers at least a 1 — 1/n*^ fraction of S[i'..j]. 

In the first case now, the longest phrase we use that overlaps S[i'..j'] must 
end at and cover at least a 1 — 1/n^ fraction of S[i' The second case 
is essentially the same as what we just considered. It follows that the number of 
our phrases that can overlap an LZ77 phrase is O(log„e n) = 0(I/e), so we use 
0(z/e) phrases in total. 

We can summarize the previous section and this one so far as follows: given 
e > 0 and read-only access to a string 5'[l..n] whose LZ77 parse consists of 
2 ; phrases, with high probability we can build an LZ77-like parse consisting of 
0{zle) phrases using logn) time, log(n)/i?) I/Os and 0{zle) 

space. Notice that, if we divide e by 2 before running our algorithm, then the 
time and I/O bounds are logn) = and \og{n)/B) = 

0{n^+‘^ /B) , so we get the slightly neater results stated in the introduction. 

Theorem 1. Given a positive e < I and read-only access to a string S'[I..n] 
whose LZ77 parse consists of z phrases, with high probability we can build an 
LZ77-like parse of S that consists of 0{zje) phrases using time, 

/B) I/Os and Oizje) space. 
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